Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided).
Chapter 6 in A Modern Approach to Regression explains that when fitting a multiple regression model, it is important to:
Determine whether the proposed regression model is a valid model (i.e., determine whether it provides an adequate fit to the data). The main tools we will use to validate regression assumptions are plots involving standardized residuals and/or fitted values. We shall see that these plots enable us to assess visually whether the assumptions are being violated and, under certain conditions, point to what should be done to overcome these violations. We shall also consider a tool, called marginal model plots, which have wider application than residual plots.
Determine which (if any) of the data points have predictor values that have an unusually large effect on the estimated regression model. (Recall that such points are called leverage points.)
Determine which (if any) of the data points are outliers, that is, points which do not follow the pattern set by the bulk of the data, when one takes into account the given model.
Assess the effect of each predictor variable on the response variable, having adjusted for the effect of other predictor variables using added variable plots.
Assess the extent of collinearity among the predictor variables using variance inflation factors
moneyball_training_data <- read.csv("~/Documents/DATA 621/data_621_hw_1/data/moneyball-training-data.csv", colClasses = c("NULL", rep(NA, 16))) + 1
summary(moneyball_training_data)
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## Min. : 1.00 Min. : 892 Min. : 70.0 Min. : 1.00
## 1st Qu.: 72.00 1st Qu.:1384 1st Qu.:209.0 1st Qu.: 35.00
## Median : 83.00 Median :1455 Median :239.0 Median : 48.00
## Mean : 81.79 Mean :1470 Mean :242.2 Mean : 56.25
## 3rd Qu.: 93.00 3rd Qu.:1538 3rd Qu.:274.0 3rd Qu.: 73.00
## Max. :147.00 Max. :2555 Max. :459.0 Max. :224.00
##
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## Min. : 1.0 Min. : 1.0 Min. : 1.0 Min. : 1.0
## 1st Qu.: 43.0 1st Qu.:452.0 1st Qu.: 549.0 1st Qu.: 67.0
## Median :103.0 Median :513.0 Median : 751.0 Median :102.0
## Mean :100.6 Mean :502.6 Mean : 736.6 Mean :125.8
## 3rd Qu.:148.0 3rd Qu.:581.0 3rd Qu.: 931.0 3rd Qu.:157.0
## Max. :265.0 Max. :879.0 Max. :1400.0 Max. :698.0
## NA's :102 NA's :131
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## Min. : 1.0 Min. :30.00 Min. : 1138 Min. : 1.0
## 1st Qu.: 39.0 1st Qu.:51.50 1st Qu.: 1420 1st Qu.: 51.0
## Median : 50.0 Median :59.00 Median : 1519 Median :108.0
## Mean : 53.8 Mean :60.36 Mean : 1780 Mean :106.7
## 3rd Qu.: 63.0 3rd Qu.:68.00 3rd Qu.: 1684 3rd Qu.:151.0
## Max. :202.0 Max. :96.00 Max. :30133 Max. :344.0
## NA's :772 NA's :2085
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## Min. : 1.0 Min. : 1.0 Min. : 66.0 Min. : 53.0
## 1st Qu.: 477.0 1st Qu.: 616.0 1st Qu.: 128.0 1st Qu.:132.0
## Median : 537.5 Median : 814.5 Median : 160.0 Median :150.0
## Mean : 554.0 Mean : 818.7 Mean : 247.5 Mean :147.4
## 3rd Qu.: 612.0 3rd Qu.: 969.0 3rd Qu.: 250.2 3rd Qu.:165.0
## Max. :3646.0 Max. :19279.0 Max. :1899.0 Max. :229.0
## NA's :102 NA's :286
The statistics reveal that several of the observations have NA values:
which(colSums(is.na(moneyball_training_data)) > 0)
## TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP
## 7 8 9 10
## TEAM_PITCHING_SO TEAM_FIELDING_DP
## 14 16
We can either ignore these predictors entirely or ignore the observations that contain any NA’s.
The statistics show that TEAM_FIELDING_E,
TEAM_PITCHING_H, TEAM_PITCHING_S0,
TEAM_PITCHING_SB have a really high maximum value compared
to the median and 3rd quarter.
for (predictor in colnames(moneyball_training_data)){
print(ggplot(moneyball_training_data, aes(x = eval(as.name(predictor)))) +
geom_boxplot() +
coord_flip() +
xlab(predictor))
}
## Warning: Removed 102 rows containing non-finite values (stat_boxplot).
## Warning: Removed 131 rows containing non-finite values (stat_boxplot).
## Warning: Removed 772 rows containing non-finite values (stat_boxplot).
## Warning: Removed 2085 rows containing non-finite values (stat_boxplot).
## Warning: Removed 102 rows containing non-finite values (stat_boxplot).
## Warning: Removed 286 rows containing non-finite values (stat_boxplot).
### Response vs. Predictor Plots
for (predictor in colnames(moneyball_training_data)){
print(ggplot(moneyball_training_data, aes(x = eval(as.name(predictor)), y = TARGET_WINS)) +
geom_point() +
xlab(predictor))
}
## Warning: Removed 102 rows containing missing values (geom_point).
## Warning: Removed 131 rows containing missing values (geom_point).
## Warning: Removed 772 rows containing missing values (geom_point).
## Warning: Removed 2085 rows containing missing values (geom_point).
## Warning: Removed 102 rows containing missing values (geom_point).
## Warning: Removed 286 rows containing missing values (geom_point).
TEAM_BATTING_H: most hits by batter in a season is
1783, so anything over should be removed or imputed https://www.baseball-almanac.com/recbooks/hits_records_mlb_teams.shtml
TEAM_PITCHING_H: most hits by batter in a season is
1783, so anything over should be removed or imputed https://www.baseball-almanac.com/recbooks/hits_records_mlb_teams.shtml
TEAM_PITCHING_SO: most strikeouts in a season is
1595 so anything above this should be removed or imputed https://www.baseball-almanac.com/recbooks/rb_strike2.shtml
TEAM_FIELDING_E: most errors in a season is 886,
anything above this should be removed or imputed https://www.baseball-fever.com/forum/general-baseball/statistics-analysis-sabermetrics/2403-team-errors-in-a-season
moneyball_training_data <- filter(moneyball_training_data, TEAM_BATTING_H < 1783 & TEAM_PITCHING_H < 1783 & TEAM_PITCHING_SO < 1595 & TEAM_FIELDING_E < 886)
for (predictor in colnames(moneyball_training_data)){
print(ggplot(moneyball_training_data, aes(x = eval(as.name(predictor)), y = TARGET_WINS)) +
geom_point() +
xlab(predictor))
}
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 377 rows containing missing values (geom_point).
## Warning: Removed 1583 rows containing missing values (geom_point).
## Warning: Removed 74 rows containing missing values (geom_point).
ggplot(data = melt(moneyball_training_data, "TARGET_WINS"), aes(value)) +
geom_histogram() +
facet_wrap(.~variable, scales = "free")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2036 rows containing non-finite values (stat_bin).
Create two models, one with the bimodal data, and one without the
bimodal data. So it looks like TEAM_BATTING_HR,
TEAM_BATTING_SO, and TEAM_PITCHING_HF. But
let’s continue…
In total there are 6 columns with missing values:
Strikeouts by batters (5%) Highly unlikely, should use median or regression model for imputation
Stolen bases (6%) bas_sb stolen bases weren’t tracked officially until 1887, so some of the missing data could be from 1871-1886. We could impute those values using the median or regression model.
Caught stealing (34%) bas_cs This statistic was not tracked until 1951. We might have to ignore this variable
Batter hit by pitch (92%) Replace with 0
Strikeouts by pitchers (4%) highly unlikely, should use median or regression model for imputation
Double plays (12%) highly unlikely, should use median or regression model for imputation
We generate the correlation plot as shown below.
corrplot(cor(moneyball_training_data, use = "na.or.complete"), method = 'number', type = 'lower', diag = FALSE, number.cex = 0.5, tl.cex = 0.5)
The variables with 1 are highly correlated with one another. The variance inflation factor for these variables is also extremely high (Go to “Linear Model with all Predictors” section.)
This link points out that if less than 5% of the data is missing, than it is okay to ignore the data.
This link points out that if more than 5% of the data is missing, than we might have to leave the variable out. But I don’t think that is necessarily the right thing to do…However, too much missing data will introduce bias which will throw off our predictions… This link also says:
“The mice package in R, helps you imputing missing values with plausible data values. These plausible values are drawn from a distribution specifically designed for each missing datapoint.”
This link says that we can impute by the means/medians if the missing values only account for 5% of the sample. Peng et al.(2006) suggest that mean imputation is permissible only if no more than 20% of the data is missing. I think we should impute because it uses a distribution, so the prediction is more robust. We can use the MICE package.
3 problems with mean imputation
Unconditional mean imputation results in an artificial reduction in variability due to the fact that you are imputing values at the center of the variable’s distribution. Also see this: 3 problems with mean imputation
We’ll probably use predictive mean matching “pmm” because it is good for quantitative variables. Mice circumvents the shrinking of errors by using multiple regression models. The variability in the different imputed values gives us a higher, but more correct standard error. Uncertainty is inherent in imputation so we are accounting for it through multiple imputed values (Source).
As long as the dataset is large enough, we can use MICE’s PMM. Also:
“Another simulation study that addressed skewed data concluded that predictive mean matching ‘may be the preferred approach provided that less than 50% of the cases have missing data and the missing data are not MNAR’ (Marshall et al. 2010)” (Source)
One of the variables is missing 34% (TEAM_BASERUN_CS).
Let’s create two models. One with this variable kept in and imputed.
Another with this variable taken out. If the P-values are low for both
models, we can than use the evaluation data and see which model has the
lower RMSE…
Multiple Imputation in R - Multiple imputation doesn’t like variables that are highly correlated with each other. We might have to take out some of the highly correlated variables…But taking out the highly correlated variables means that we might be leaving out some important information inherent in the data…So we might do a linear combination of the correlated variables in order to reduce the correlation….
I think that we should transform to a normal distribution first using Box-Cox, and then impute. And then we can
Top of page 158 in A Modern Approach to Regression describes the following for an example on food prices:
“Assuming that condition (6.6) holds we next look at plots of standardized residuals against each predictor (see Figure 6.2). The random nature of these plots is indicative that model (6.8) is a valid model for the data.”
Therefore, let’s assume that we have a linear model fitting all of the predictors and view the plots of standardized residuals against all of the predictors.
Note that when using all of the variables, observations with any NAs are omitted.
lmod <- lm(TARGET_WINS ~ ., moneyball_training_data)
standard_res <- rstandard(lmod)
for (predictor in colnames(moneyball_training_data[-1])){
plot(na.omit(moneyball_training_data)[[predictor]],
standard_res,
xlab = predictor,
ylab = "standardized_residuals")
}
TBD
vif(lmod)
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 1.171824e+05 1.685623e+00 1.302198e+00 3.074804e+05
## TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## 1.962853e+05 1.941752e+05 1.950069e+00 1.914415e+00
## TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## 1.096334e+00 1.160417e+05 3.069624e+05 1.964039e+05
## TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 1.946316e+05 1.256819e+00 1.097611e+00
The plot for TEAM_BATTING_2B doesn’t look random. It
looks like that there is an outlier to the left of the plot. It’s hard
to tell if the other plots follow a pattern or not…
plot(na.omit(moneyball_training_data)$TARGET_WINS, predict(lmod), xlab = 'y', ylab = 'y_hat')
abline(a = 0, b = 1)
sumary(lmod)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.410051 19.704264 3.1166 0.002139
## TEAM_BATTING_H 1.913476 2.761394 0.6929 0.489267
## TEAM_BATTING_2B 0.026388 0.030290 0.8712 0.384844
## TEAM_BATTING_3B -0.101176 0.077507 -1.3054 0.193477
## TEAM_BATTING_HR -4.843707 10.508511 -0.4609 0.645420
## TEAM_BATTING_BB -4.459691 3.636241 -1.2265 0.221675
## TEAM_BATTING_SO 0.341963 2.598759 0.1316 0.895462
## TEAM_BASERUN_SB 0.033044 0.028673 1.1524 0.250708
## TEAM_BASERUN_CS -0.011044 0.071431 -0.1546 0.877303
## TEAM_BATTING_HBP 0.082473 0.049600 1.6628 0.098152
## TEAM_PITCHING_H -1.890957 2.760946 -0.6849 0.494317
## TEAM_PITCHING_HR 4.930432 10.506645 0.4693 0.639462
## TEAM_PITCHING_BB 4.510891 3.633720 1.2414 0.216120
## TEAM_PITCHING_SO -0.373645 2.597052 -0.1439 0.885767
## TEAM_FIELDING_E -0.172042 0.041404 -4.1552 5.076e-05
## TEAM_FIELDING_DP -0.108192 0.036541 -2.9609 0.003494
##
## n = 191, p = 16, Residual SE = 8.46704, R-Squared = 0.55
plot(lmod)
The fitted values plot above indicates that \(Y\) and \(\hat{Y}\) might not be linearly related. It looks like the slope should be less and the y-intercept should be higher… We therefore should do a Box-Cox transformation to overcome this nonlinearity.
TBD
Try replacing the 0 values with really small values(1e-6) which will allow you to perform the Box-Cox transformation.
Let’s try the simple backward and forward elimination
step.model <- stepAIC(lmod, direction = "both", trace = FALSE)
summary(step.model)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HBP +
## TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
## TEAM_FIELDING_E + TEAM_FIELDING_DP, data = moneyball_training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.2248 -5.6294 -0.0212 5.0439 21.3065
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.01842 19.13272 3.241 0.001413 **
## TEAM_BATTING_H 0.02541 0.01009 2.518 0.012648 *
## TEAM_BATTING_HBP 0.08712 0.04852 1.796 0.074211 .
## TEAM_PITCHING_HR 0.08945 0.02394 3.736 0.000249 ***
## TEAM_PITCHING_BB 0.05672 0.00940 6.034 8.66e-09 ***
## TEAM_PITCHING_SO -0.03136 0.00728 -4.308 2.68e-05 ***
## TEAM_FIELDING_E -0.17218 0.03970 -4.338 2.38e-05 ***
## TEAM_FIELDING_DP -0.11904 0.03516 -3.386 0.000869 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.422 on 183 degrees of freedom
## (1583 observations deleted due to missingness)
## Multiple R-squared: 0.5345, Adjusted R-squared: 0.5167
## F-statistic: 30.02 on 7 and 183 DF, p-value: < 2.2e-16
step.model <- stepAIC(lmod, direction = "forward", trace = FALSE)
summary(step.model)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO +
## TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_BATTING_HBP + TEAM_PITCHING_H +
## TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
## TEAM_FIELDING_E + TEAM_FIELDING_DP, data = moneyball_training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.8708 -5.6564 -0.0599 5.2545 22.9274
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.41005 19.70426 3.117 0.00214 **
## TEAM_BATTING_H 1.91348 2.76139 0.693 0.48927
## TEAM_BATTING_2B 0.02639 0.03029 0.871 0.38484
## TEAM_BATTING_3B -0.10118 0.07751 -1.305 0.19348
## TEAM_BATTING_HR -4.84371 10.50851 -0.461 0.64542
## TEAM_BATTING_BB -4.45969 3.63624 -1.226 0.22167
## TEAM_BATTING_SO 0.34196 2.59876 0.132 0.89546
## TEAM_BASERUN_SB 0.03304 0.02867 1.152 0.25071
## TEAM_BASERUN_CS -0.01104 0.07143 -0.155 0.87730
## TEAM_BATTING_HBP 0.08247 0.04960 1.663 0.09815 .
## TEAM_PITCHING_H -1.89096 2.76095 -0.685 0.49432
## TEAM_PITCHING_HR 4.93043 10.50664 0.469 0.63946
## TEAM_PITCHING_BB 4.51089 3.63372 1.241 0.21612
## TEAM_PITCHING_SO -0.37364 2.59705 -0.144 0.88577
## TEAM_FIELDING_E -0.17204 0.04140 -4.155 5.08e-05 ***
## TEAM_FIELDING_DP -0.10819 0.03654 -2.961 0.00349 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.467 on 175 degrees of freedom
## (1583 observations deleted due to missingness)
## Multiple R-squared: 0.5501, Adjusted R-squared: 0.5116
## F-statistic: 14.27 on 15 and 175 DF, p-value: < 2.2e-16
step.model <- stepAIC(lmod, direction = "backward", trace = FALSE)
summary(step.model)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HBP +
## TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
## TEAM_FIELDING_E + TEAM_FIELDING_DP, data = moneyball_training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.2248 -5.6294 -0.0212 5.0439 21.3065
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.01842 19.13272 3.241 0.001413 **
## TEAM_BATTING_H 0.02541 0.01009 2.518 0.012648 *
## TEAM_BATTING_HBP 0.08712 0.04852 1.796 0.074211 .
## TEAM_PITCHING_HR 0.08945 0.02394 3.736 0.000249 ***
## TEAM_PITCHING_BB 0.05672 0.00940 6.034 8.66e-09 ***
## TEAM_PITCHING_SO -0.03136 0.00728 -4.308 2.68e-05 ***
## TEAM_FIELDING_E -0.17218 0.03970 -4.338 2.38e-05 ***
## TEAM_FIELDING_DP -0.11904 0.03516 -3.386 0.000869 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.422 on 183 degrees of freedom
## (1583 observations deleted due to missingness)
## Multiple R-squared: 0.5345, Adjusted R-squared: 0.5167
## F-statistic: 30.02 on 7 and 183 DF, p-value: < 2.2e-16